-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for cohere command-r and chat models #1031
base: main
Are you sure you want to change the base?
Conversation
@@ -369,6 +369,11 @@ def assemble_prompt(prompt_size, book_path): | |||
"Peace is the only way", | |||
] | |||
|
|||
if model.config.model_type == "cohere": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this specific for this model or any model with chat_template in tokenizer, and has input with chat_format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I shall check it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@libinta , it appears to be specific to cohere: https://huggingface.co/CohereForAI/c4ai-command-r-v01 # Format message with the command-r chat template
, will add a note to that effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if --chat_template is more generic:
Qwen2
python run_generation.py --model_name_or_path Qwen/Qwen2-0.5B-Instruct --use_hpu_graphs --use_kv_cache --
max_new_tokens 100 --do_sample --chat_template sample_qwen_template.json --bf16 --batch_size 2
Chat template:
[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me a short introduction to large language model."}
]
Input/outputs:
input 1: ('<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n',)
output 1: ('system\nYou are a helpful assistant.\nuser\nGive me a short introduction to large language model.\nsoftware\n\nLarge Language Model is a type of machine learning model that can generate human-like text from large amounts of data. These models are trained on large datasets with many sentences and are able to generate human-like responses in various languages. Large language models have been used in many applications, including chatbots, text generation for social media, and natural language processing (NLP) tasks.\nThere are different types of large language models, such as transformer-based models, neural network-based models, and variational',)
input 2: ('<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n',)
output 1: ('system\nYou are a helpful assistant.\nuser\nGive me a short introduction to large language model.\nuser: What is the definition of an AI?\nuser: Can you describe the process of training an AI model?\nuser: How does a deep learning algorithm learn from data?\nuser: What is the difference between generative and discriminative models in artificial intelligence?\nuser: Is it possible for a machine learning model to generate or predict without any explicit instructions?\nuser: Could AI be used as a substitute for human teachers?',)
Stats:
---------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 441.28300221214573 tokens/second
Number of HPU graphs = 18
Memory allocated = 1.52 GB
Max memory allocated = 1.59 GB
Total memory available = 94.62 GB
Graph compilation duration = 2.503536907985108 seconds
---------------------------------------------------------------------------------------------------------------
Gemma
python run_generation.py --model_name_or_path "google/gemma-1.1-2b-it" --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --chat_template sample_gemma_template.json --bf16 --batch_size 2
Chat template:
[
{ "role": "user", "content": "Write a hello world program" }
]
Input/outputs:
input 1: ('<bos><start_of_turn>user\nWrite a hello world program<end_of_turn>\n',)
output 1: ('user\nWrite a hello world program\nimport java.util.Scanner;\n\npublic class HelloWorld {\n\n public static void main(String[] args) {\n Scanner scanner = new Scanner(System.in);\n\n // Read user input\n System.out.println("Hello, world!");\n\n // Close the scanner\n scanner.close();\n }\n}\n```\n\n**Explanation:**\n\n* The code you provided is a simple Java program that demonstrates how to create and use a `Scanner` object',)
input 2: ('<bos><start_of_turn>user\nWrite a hello world program<end_of_turn>\n',)
output 1: ('user\nWrite a hello world program\n```c\n#include <stdio.h>\n\nint main()\n{\n printf("Hello, world!\\n");\n\n return 0;\n}\n```\n\n**Explanation:**\n\n* The program starts with the `#include <stdio.h>` line, which includes the standard input/output (stdio) library. This allows the program to use functions like `printf` and `return`.\n* The `main()` function is the entry point of the program.\n',)
Stats:
--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 544.5759928878326 tokens/second
Number of HPU graphs = 14
Memory allocated = 5.88 GB
Max memory allocated = 6.2 GB
Total memory available = 94.62 GB
Graph compilation duration = 3.638087995001115 seconds
--------------------------------------------------------------------------------------------------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vidyasiv some models like qwen2 already have chat-template inside tokenizer, should we utilize that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@libinta , could you clarify? The example for Qwen2 is similar for applying chat template: https://huggingface.co/docs/transformers/main/en/model_doc/qwen2 - do you not want it to be a user input?
Guidance from documentation is to always set it explicitly:
https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-default-templates
relevant lines: You can find out what the default template for your tokenizer is by checking the tokenizer.default_chat_template attribute. This is something we do purely for backward compatibility reasons, to avoid breaking any existing workflows. Even when the class template is appropriate for your model, we strongly recommend overriding the default template by setting the chat_template attribute explicitly to make it clear to users that your model has been correctly configured for chat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either way is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was an error in my understanding of how this works.. The tokenizer "chat_template" or tokenizer.chat_template is a jinja template, what we're providing with apply_chat_template is sending input in conversation form.
https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template
. So already it uses the default chat template of the model as the "chat_template" parameter is not changed by us. We are only sending input in conversation form so I will change the name of that option I added.
@libinta, For cohereai on v1.16.0 I see performance:
The model is not yet optimized so there is definitely more room for improvement in perf. |
@libinta , could you take another look? |
* Add StoppingCriteriaList for C4AI Command-R support * Revert deletion of MaxNewTokensCriteria
Co-authored-by: Soila Kavulya <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
@@ -397,6 +402,20 @@ def assemble_prompt(prompt_size, book_path): | |||
"Peace is the only way", | |||
] | |||
|
|||
# Apply input as conversation if tokenizer has a chat template | |||
if args.conversation_input and hasattr(tokenizer, "chat_template"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the conditional be part of the upper one?
if args.prompt:
...
elif args.book_source:
...
elif args.conversation_input and hasattr(tokenizer, "chat_template"):
...
else:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another concern is that the user provide both prompt and conversation_input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@regisss would you please check this PR
Let's keep it open and I'll try to have it merged before the next release of Optimum Habana |
What does this PR do?
Fixes # (issue)
Authors: Soila Kavulya, Vidya Galli
Test output
1 passed, in 1098.97s (0:18:18)
Gaudi2 Results:
Command
Output
Before submitting